LLM Evaluation Framework

Replicate Huggingface Open LLM Leaderboard Locally

LLM Evaluation
Author

Matthias De Paolis

Published

May 1, 2025

Image

The discontinuation of Hugging Face’s Open LLM Leaderboard has left a gap in the community for standardized evaluation of large language models (LLMs). To address this, I developed the LLM Evaluation Framework, a comprehensive and modular tool designed to facilitate reproducible and extensible benchmarking of LLMs across various tasks and benchmarks.

The LLM Evalaution Framework can be found on my Github account: LLM Evaluation Framework

🧩 Why This Framework Matters

The Open LLM Leaderboard was instrumental in providing a centralized platform for evaluating and comparing LLMs. Its retirement has underscored the need for tools that allow researchers and developers to conduct their own evaluations with transparency and consistency. The LLM Evaluation Framework aims to fill this void by offering: - Modular Design: Inspired by microservice architecture, enabling easy integration and customization. - Multiple Model Backends: Support for Hugging Face (hf) and vLLM backends, allowing flexibility in model loading and inference. - Quantization Support: Evaluate quantized models (e.g., 4-bit, 8-bit with hf, AWQ with vLLM) to assess performance under resource constraints. - Comprehensive Benchmarks: Includes support for standard benchmarks like MMLU, GSM8K, BBH, and more. - Leaderboard Replication: Easily run evaluations mimicking the Open LLM Leaderboard setup with standardized few-shot settings. - Flexible Configuration: Customize evaluations via CLI arguments or programmatic usage. - Detailed Reporting: Generates JSON results and Markdown reports for easy analysis. - Parallelism: Leverages vLLM for efficient inference, including tensor parallelism across multiple GPUs.

🚀 Getting Started

Installation 1. Clone the Repository:

!git clone https://github.com/mattdepaolis/llm-evaluation.git
!cd llm-evaluation
  1. Set Up a Virtual Environment:
!python -m venv .venv
!source .venv/bin/activate  # On Windows use `.venv\Scripts\activate`
  1. Install Dependencies:
!pip install -e lm-evaluation-harness
!pip install torch numpy tqdm transformers accelerate bitsandbytes sentencepiece
!pip install vllm  # If you plan to use the vLLM backend

🧪 Example Usage

Using the Command-Line Interface (CLI)

Evaluate a model on the HellaSwag benchmark:

!python llm_eval_cli.py \
  --model hf \
  --model_name google/gemma-2b \
  --tasks hellaswag \
  --num_fewshot 0 \
  --device cuda  # Use 'cpu' if you don't have a GPU

This command will download the gemma-2b model (if not cached), run it on the HellaSwag benchmark with 0 few-shot examples, and save the results in the results/ and reports/ directories.

Using as a Python Library

Integrate the evaluation logic directly into your Python scripts:

from llm_eval import evaluate_model
import os

# Define evaluation parameters
eval_config = {
    "model_type": "hf",
    "model_name": "google/gemma-2b-it",
    "tasks": ["mmlu", "gsm8k"],
    "num_fewshot": 0,
    "device": "cuda",
    "quantize": True,
    "quantization_method": "4bit",
    "batch_size": "auto",
    "output_dir": "./custom_results"  # Optional: Specify output location
}

# Run the evaluation
try:
    results_summary, results_file_path = evaluate_model(**eval_config)

    print("Evaluation completed successfully!")
    print(f"Results summary: {results_summary}")
    print(f"Detailed JSON results saved to: {results_file_path}")

    # Construct the expected report path
    base_name = os.path.splitext(os.path.basename(results_file_path))[0]
    report_file_path = os.path.join(os.path.dirname(results_file_path).replace('results', 'reports'), f"{base_name}_report.md")

    if os.path.exists(report_file_path):
        print(f"Markdown report saved to: {report_file_path}")
    else:
        print("Markdown report not found at expected location.")

except Exception as e:
    print(f"An error occurred during evaluation: {e}")

📊 Reporting and Results

The framework generates: - JSON Results: Detailed results for each task, including individual sample predictions (if applicable), metrics, and configuration details, saved in the results/ directory. - Markdown Reports: A summary report aggregating scores across tasks, generated in the reports/ directory.

📄 How the Evaluation Report Looks

When you run an evaluation using the LLM Evaluation Framework, it generates comprehensive yet easy-to-understand reports in both Markdown and JSON formats. Here’s a broad overview of what you can expect from the Markdown report:

1. 📊 Summary of Metrics

This section provides a clear table summarizing the evaluation results for each individual task. Each row includes: • Task: The specific benchmark or task evaluated (e.g., leaderboard_bbh_boolean_expressions). • Metric: The evaluation metric used (e.g., accuracy, exact match). • Value: The model’s performance score on that task.

2. 📈 Normalized Scores

This section provides normalized scores, giving you an easy-to-interpret percentage representation of the model’s performance relative to the benchmark standards. It includes: • Benchmark: The benchmark’s name. • Score: The normalized percentage score.

This helps quickly identify the relative strengths and weaknesses of the evaluated model.

3. 🔍 Task Samples (Detailed Examples)

The report also offers detailed, human-readable examples from evaluated tasks, allowing you to qualitatively assess the model’s outputs: • Question: Clearly presents the evaluation sample question. • Ground Truth: The correct answer or expected response. • Model Response: The exact response provided by your evaluated model, clearly labeled as correct or incorrect.

This section is especially valuable for error analysis and understanding how your model handles specific types of queries.

⚙️ Customization

These reports can be customized or extended further by modifying the reporting logic, enabling deeper analyses or alternative formats as needed.

🔧 Extending the Framework

The modular design makes it easier to add new functionalities:

  1. Adding New Tasks/Benchmarks:
  • Define the task configuration in llm_eval/tasks/task_registry.py or a similar configuration file.
  • Ensure the task is compatible with the lm-evaluation-harness structure or adapt it.
  1. Supporting New Model Backends:
  • Create a new model handler class in llm_eval/models/ inheriting from a base model class (if applicable).
  • Implement the required methods for loading, inference, etc.
  • Register the new backend type. 
  1. Customizing Reporting:
  • Modify the report generation logic in llm_eval/reporting/ to change the format or content of the Markdown/JSON outputs.

🤝 Contributing

Contributions are welcome! Please follow standard practices:

  1. Fork the repository.
  2. Create a new branch for your feature or bug fix (git checkout -b feature/my-new-feature).
  3. Make your changes and commit them (git commit -am ‘Add some feature’).
  4. Push to the branch (git push origin feature/my-new-feature).
  5. Create a new Pull Request.